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PRINCIPAL COMPONENTS ANALYSIS IN THE SPACE OF 
PHYLOGENETIC TREES 

By Tom M. W. Nye 

Newcastle University 

Phylogenetic analysis of DNA or other data commonly gives rise 
to a collection or sample of inferred evolutionary trees. Principal 
Components Analysis (PCA) cannot be applied directly to collec- 
tions of trees since the space of evolutionary trees on a fixed set of 
taxa is not a vector space. This paper describes a novel geometri- 
cal approach to PCA in tree-space that constructs the first principal 
path in an analogous way to standard linear Euclidean PCA. Given 
a data set of phylogenetic trees, a geodesic principal path is sought 
that maximizes the variance of the data under a form of projection 
onto the path. Due to the high dimensionality of tree-space and the 
nonlinear nature of this problem, the computational complexity is 
potentially very high, so approximate optimization algorithms are 
used to search for the optimal path. Principal paths identified in this 
way reveal and quantify the main sources of variation in the original 
collection of trees in terms of both topology and branch lengths. The 
approach is illustrated by application to simulated sets of trees and 
to a set of gene trees from metazoan (animal) species. 

Introduction. Inference of evolutionary or phylogenetic trees is a funda- 
mental task in many areas of biology, and tree estimation has developed over 
several decades into a mature statistical field [13]. On a phylogenetic tree, 
leaves correspond to existing observed taxa, internal vertices correspond to 
ancestral taxa, and branch lengths represent the degree of evolutionary di- 
vergence between taxa. A phylogenetic tree representing the division and 
divergence of different species is called a species tree. However, individual 
regions of DNA can evolve according to trees that differ from the underlying 
species tree, and an inferred phylogenetic tree from a particular gene or DNA 
region is called a gene tree. Gene trees can differ from the species tree for 
several reasons: random variation in the process of DNA letter substitution; 
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population effects by which the evolutionary course of an individual gene 
does not match that of the species as a whole [10]; and even relatively rare 
events whereby genetic material is exchanged between species in a nontree- 
like manner [11]. Phylogenetic analysis of a number of different genes in 
a fixed set of species therefore generally gives rise to a collection of alter- 
native phylogenetic trees. Collections of alternative phylogenetic trees also 
arise from inferential methods that involve simulation: bootstrap replication 
and MCMC sampling from Bayesian posteriors are widely used in the con- 
struction of phylogenetic estimates. Given such a collection of alternative 
trees, whether gene trees or a simulated sample, identifying differences and 
quantifying variation is a difficult problem, since we might potentially have 
several hundred trees on thousands of species. Standard multivariate statis- 
tical methods such as clustering [8, 23, 28] and Multi-Dimensional Scaling 
(MDS) [8, 19] have been used to address this problem. Principal Components 
Analysis (PCA), in contrast, cannot be applied directly since the space of 
phylogenetic trees on a fixed set of species is not a Euclidean vector space. 
This paper describes a geometric approach to PCA for sets of alternative 
phylogenetic trees. The aim is to identify which tree features are most vari- 
able within a given set of trees and to quantify this variation — ^just as the 
first few components in regular PCA pick out the most variable features 
of a Euclidean data set. Although PCA has been used to analyze different 
phylogenetic data previously (such as distance matrix data), the method 
presented here is the first to work intrinsically within the space of phyloge- 
netic trees. The approach relies to a large extent on existing mathematical 
tools, and the main novel contribution comes from combining those elements 
into a computationally feasible method. 

A key feature of our approach is the incorporation of both topological 
and geometrical information from the trees under analysis, via the so-called 
geodesic metric on the space of trees [5, 22, 25]. Topological information 
refers to the exact pattern of branching within a tree, while geometrical in- 
formation refers to the distances between taxa induced by branch lengths 
on the tree. Topological features are generally more straightforward to char- 
acterize in a set of alternative trees, by counting the proportion of trees 
containing a given feature. For example, bootstrap replicate data sets are 
often represented by a single "consensus" tree annotated with a percent- 
age support for each clade within the tree [12]. However, the geometry and 
topology of evolutionary trees are intimately related: we can continuously 
change the topology of a tree by shrinking down the length of any internal 
branch and expanding out an alternative branch in its place, as shown in 
Figure 1. Recent authors [22] have stressed the importance of using geo- 
metrical information to draw comparisons between trees on account of the 
interdependence of tree geometry and topology, and due to the increased 
distinguishability obtained by using continuous rather than discrete metrics. 
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Fig. 1. A schematic view of a region of tree-space on five taxa: points m space correspond 
to unrooted trees. Trees with the same topology all lie in the same quadrant of tree-space 
(trees a;, it, e.g.). Different quadrants are joined along their edges. Tree x can be contin- 
uously deformed into tree z by shrinking down the branch DE via tree y and replacing 
it with the branch CD. It follows that tree z is obtained from x via nearest neighbor in- 
terchange (NNI) of the split ABC\DE into split ABE\CD. However, at y another NNI 
move is possible: ABC\DE could be replaced by ABD\CE , corresponding to the lower left 
quadrant. 

Moreover, tree geometry plays an important role in inference: it has been 
shown that long branches tend to "attract" each other, leading to mistakes 
in the topology of inferred trees [21]. 

Taking a set of alternative phylogenetic trees on some fixed set of taxa 
as input, our approach identifies a path L through tree-space that can be 
thought of analogously to the first principal component in regular Euclidean 
PCA. The path consists of a smoothly changing tree structure in which cer- 
tain branches expand or shrink. Alternative topologies emerge when internal 
branches are shrunk to have zero length and are then replaced with topolog- 
ically distinct branches. The path L is constructed in such a way that the 
changing features — both in terms of topology and geometry — correspond to 
the most variable features within the data set. Just as for regular PCA, L 
also captures correlations in the data set: features that tend to occur together 
in the data are also represented together on L. A quantitative measure of 
variability can be assigned to L, in analogy with the proportion of variance 
contributed by each component in regular PCA. Unlike Euclidean vector 
spaces, there is no inner product on tree-space, and so the analysis cannot 
be extended in a straightforward manner to provide higher order principal 
paths by working orthogonally to L. Further discussion is given in Section 6. 
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Our approach — which we will refer to as <I>PCA (for "phylogenetic" PCA) — 
is motivated by geometrical analogy with regular vector space PCA. Con- 
struction of the first principal component in a Euclidean vector space can 
be thought of as follows: 

(1) Given a set of vectors xi,. . . ,Xn identify the centroid x. 

(2) For a fixed line L through x take the orthogonal projection of the 
points xi, . . . ,Xn onto L. 

(3) Identify the line that maximizes the variance of the projected points 
along L, or, equivalently, which minimizes the sum of squared orthogonal 
distances of the points from the line. 

For a Euclidean vector space, these steps can be re-expressed and solved 
in terms of simple linear algebra. However, the space of phylogenetic trees 
on a fixed set of taxa is not a Euclidean vector space, so these steps can- 
not be applied directly in the same way to sets of alternative phylogenies. 
Tree-space can be equipped with various metrics that allow geometry to be 
performed, and for reasons described below, we use the geodesic metric [5]. 
$PCA then follows a similar set of steps to those above, but working with 
the geodesic metric, d{-,-). In step (2), the lines L become paths in tree- 
space with the property that, for any pair of points on the path, the path 
coincides with the geodesic between the points. Trees xi,...,Xn are "pro- 
jected" onto each path by finding points yi on the path that minimize the 
distance d{xi, yi) for i = 1, . . . ,n. Pythagoras' theorem does not hold in tree- 
space, so in the analog of step (3), paths which maximize the variance can 
be different from paths which minimize the sum of squared distances. We 
consider searching for both types of paths. Step (3) is potentially excessively 
computationally demanding, and so we describe (i) a greedy algorithm for 
constructing optimal paths and (ii) a Monte Carlo optimization approach. 
The methods we propose for steps (2) and (3) form the novel contribution of 
this paper. "Projection" of points onto a geodesic path L in step (2) is rela- 
tively simple to perform using existing methods for computing the geodesic 
metric, but a detailed algorithm has not been given previously. Searching 
over the set of possible paths is more technically demanding. Consideration 
of this particular problem and the solutions we present appear to be entirely 
novel. 

The development of $PCA has been influenced by a recent paper by 
Wang and Marron [29] . Wang and Marron addressed a similar problem, de- 
veloping a form of PCA for data sets with a tree-like structure. In a second 
paper [3], they applied their method to sets of trees obtained from med- 
ical imaging data. In particular, their reformulation of PCA in terms of 
the geometrical steps specified above motivated the corresponding steps in 
$PCA. Other authors have also developed analogs of PCA in nonstandard 
geometries [14, 18], and Wang and Marron give an excellent overview of this 
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area of research [29] . However, it must be stressed that the method of Wang 
and Marron does not apply to sets of phylogenetic trees, and that "3>PCA 
is not simply a reworking of their approach. On account of the ostensible 
similarities between the approaches, we devote a section to explaining the 
relationship between them later in the paper. 

The remainder of the paper is structured in the following way. We first 
describe the geometry of tree-space and set up necessary notation and math- 
ematical background. Section 2 contains a description of the $PCA approach 
and proofs of its properties. We then explain more fully the relationship to 
the work of Wang and Marron, before evaluating <I>PCA on simulated sets 
of trees and a real set of gene trees from metazoan species. 

1. Background: The geometry of tree-space. 

1.1. Splits and vector representation of trees. We will work throughout 
with a fixed set of taxa O = {oi, . . . , Om] and the set of unrooted phyloge- 
netic trees To on O. Given a tree x € To, cutting any branch on x partitions 
the taxa into two unordered nonoverlapping sets. Such a partition is called 
a split, and splits are usually denoted X\X'^ where X C O and c denotes the 
complement in O. There are M = 2"^~^ — 1 possible (nonempty) splits of the 
set O, and the set of these is denoted S. It is crucial to note that arbitrary 
sets of splits do not generally correspond to valid tree topologies — a com- 
patibility condition must be satisfied. For example, if O = {A,B,C,D,E}, 
then the two splits {A, B}\{C, D, E} and {A,C}\{B,D,E} cannot both be 
represented on the same tree. 

Any tree x gTq can be regarded as a weighted set of compatible splits, 
where the weight assigned to each split is given by the length of the corre- 
sponding branch on x. We only consider trees with positive branch lengths. 
We write to denote the set of splits in x, and encapsulate the branch 
lengths via a function Xx'. S ^ M"*" defined by 



Tree-space To can then be embedded in in the following way. Take the 
standard basis of and associate each split p ^ S with a different basis 
vector Bp. Any tree x €To can then be associated uniquely with the vector 




branch length associated with p, 



ifpG Tx, 
otherwise. 



(1.1) 



Ax = y^Aa;(y>)ep. 

PG5 



In fact, it is convenient to abuse notation slightly and write p for the ba- 
sis vector ep, identifying each split directly with the corresponding vector 
in M*^. Equation (1.1) essentially associates every tree x with a vector of 
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branch lengths, but due to the compatibihty relations between splits, not 
every such vector corresponds to a tree. In fact, each tree contains at most 
2m — 3 splits, so as the number of taxa m increases, 2m — 3 ^ M and To 
becomes an increasingly sparse subset of M^. 

Since a collection rci, . . . of trees can be regarded as a set of vectors 
Aa;^ , . . . , Aa;„, why not just perform PCA on these vectors? In general, the 
principal components obtained in this way will not correspond to valid trees, 
and interpretation of the principal components becomes impossible. A form 
of PCA which operates intrinsically within To and which produces inter- 
pretable "components" is required. 

1.2. Decomposition of tree-space by topology. The geometry of To was 
first comprehensively studied in a paper by Billera et al. [5] , which included 
the definition and proof of existence of geodesies. Their description of To 
amounts to a decomposition into a set of overlapping component pieces, each 
piece corresponding to a different tree topology. In this section we recall 
aspects of this decomposition which are central to $PCA, most importantly 
for the definition of geodesies on To- 

The decomposition is easiest to understand by identifying To with its 
image under the embedding in M.'^ . Every tree in To contains the set of 
splits corresponding to terminal edges (those that end in a leaf), denoted 
'S'term C S. Siucc every tree contains every terminal split 

To = Span+{p :p e 5tcrm} x 7b,int, 

where span^ denotes the span of vectors with nonnegative weights, and 7o,int 
is the part of tree-space corresponding to internal splits. Next consider a sin- 
gle unrooted tree x which is fully resolved, by which we mean every internal 
vertex has exactly 3 neighbors. Let t denote the topology of x or, more 
precisely, the set of nonterminal splits t = Tx\ S'term- Since x is fully re- 
solved, it has m — 3 internal edges, so t contains m — 3 splits. The internal 
branch lengths of any tree with topology t are determined by a point in 
Qt = span_|_{p:j> G t}. We call Qt the topological orthant containing x, and 
it is isomorphic to the positive orthant of M"^"^. The faces of the orthant Qt 
correspond to trees that have some zero length branches. Such trees are not 
fully resolved or, in other words, some internal vertices have more than 3 
neighbors. This structure is illustrated in Figure 1. 

Tree-space To is formed from the union of the orthants Qt over all possible 
fully resolved topologies t: 

7b = Span_^{p:^»G 5'term} X IJ Qt- 

resolved 
topologies t 

The individual orthants Qt are stitched together along their faces, since 
each unresolved tree occurs on the face of more than one orthant. To un- 
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derstand how the orthants are stitched together in more detail, consider 
a point on the face of an orthant Qt at which a single branch length corre- 
sponding to a split p has been collapsed to zero. As illustrated in Figure 1, 
there are two ways in which this branch can be replaced with an alterna- 
tive, thereby obtaining a fully resolved tree with a different topology. Each 
(m — 4)-dimensional face of Qt is therefore identified with corresponding 
faces in two other orthants Qt' and Qt" ■ The operation illustrated in Figure 1 
is called Nearest Neighbor Interchange (or NNI); we say that topologies t' 
and t" are obtained by NNI of the split p within t. Faces of Qt with co- 
dimension greater than 1 will be contained in more than two other orthants. 
Later, we will need to deal with paths in To between such faces and so we 
need to extend the definition of NNI (which is usually taken as a relation- 
ship between strictly binary trees). Given a split p in a fixed tree, there are 
two or more subtrees hanging from each end of the associated edge in the 
tree. An extended NNI move (or XNNI) consists of swapping a subtree from 
one end of the branch with a subtree from the opposite end. This operation 
removes split p from the tree and replaces it with an incompatible split p' . 
On a binary tree this definition coincides with the standard definition of 
NNI [1], and XNNI includes all NNI moves. 

1.3. Geodesies and the geodesic metric. To can be equipped with met- 
rics via the embedding into described above. In particular, L2 norm 
on defines a metric: d2{x,y) = \Xx — Xy\2- However, such metrics are 
not intrinsic to tree-space. For example, when x and y have different topolo- 
gies, £^2 corresponds to the length of a straight line segment joining x to y 
through M.^'^ , but this line contains points outside the image of To under the 
embedding. 

Billera et al. [5] proved the existence of a metric that locally resembles 
the L2 metric, but which is intrinsic to To independent of the embedding 
in M^^. This metric is called the geodesic metric d, and it is the canonical 
metric for <&PCA due to its intrinsic nature. It is defined as follows. For 
two trees x and y with the same topology, d{x, y) = d2ix, y). When x and y 
have different topologies, d{x, y) is defined as the length of the shortest con- 
tinuous path joining x to y in To which consists of a series of straight line 
segments through any feasible sequence of topological orthants. The length 
of such a path is defined to be the sum of the Euclidean lengths of each 
constituent line segment. The shortest such path joining j; to y is called the 
geodesic between x and y. The proof that geodesies exist between points 
in To and that geodesies define a valid metric is given in [5]. As part of the 
proof, Billera et al. [5] showed that tree-space is CAT(O) [16]. This means 
that triangles in To are "skinny" in comparison to triangles in the Euclidean 
plane. More formally, given points x,y,z gTo, consider the triangle between 
points x',y',z' in the Euclidean plane with the same edge lengths, so that 
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Fig. 2. Geodesies in tree-space consist of line segments through different topological quad- 
rants. On five taxa there are 15 different quadrants, hut only three are shown above, each 
with a sketch of the corresponding topology. Each axis corresponds to the length of a dif- 
ferent split. The shaded region does not correspond to a valid quadrant since the splits 
AB\CDE and BE\ACD are incompatible. The geodesic between xi and yi passes through 
three quadrants, whereas the geodesic between X2 and 3/2 passes through just two quadrants. 
In this case the geodesic is the same as the cone path. 

d{x,y) = d{x',y'), etc. If j{t) is the path-length parameterized geodesic be- 
tween X and y and 7'(t) the corresponding geodesic in the Euchdean plane, 
then d{z,^{t)) < d{z','y'(t)) for all points 7(t) between x and y. 

Geodesies in To have the following properties. First, if x,y G To have the 
same topology t, then the geodesic joining them is the obvious Euclidean 
line segment in Qt. Second, when x and y have some but not all splits 
in common, the splits in the intersection H Ty are all included at every 
point along the geodesic. The length of the branch associated to p GT^OTy 
changes in the obvious linear way from Xxip) to \y{p)- Third, when x and y 
have different topologies, the geodesic may pass through other topological 
orthants than the two associated with x and y, as illustrated by Figure 2. 
This is the case for points xi and yi in the figure. However, trees along the 
geodesic only ever contain splits from TxUTy, albeit in different combina- 
tions. It follows that when x and y have different topologies, computing the 
geodesic distance d(x,y) is nontrivial. However, an efficient polynomial-time 
algorithm has been developed for constructing geodesies [25], and we use this 
algorithm to calculate distances in <I>PCA. 

A crucial feature of CAT(O) spaces is that paths which are everywhere 
locally geodesic are necessarily globally geodesic (see [25], Lemma 2.1). 
Geodesies like that between xi and yi in Figure 2 must therefore not "bend" 
as they cross between different orthants. For some pairs x,y, however, the 
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shortest path is given by collapsing branch lengths for splits in Tx \ Ty down 
to zero, so that the topology is then Tx n Ty followed by expanding out 
branch lengths in Ty \ Tx to obtain y. Any two points can be joined by such 
a path, and they are referred to as cone paths. In Figure 2 the cone path co- 
incides with geodesic for points X2 and 1/2; geodesic between xi,yi is clearly 
shorter than the cone path. 

2. Methods. 

2.1. Existence of principal paths. We now have the geometrical ingre- 
dients needed to define the $PCA procedure. <I>PCA seeks to construct 
a principal path from the set of To-hnes defined as follows. 

Definition 1. A path F in To is a To-Hne if: 

(i) every sub-path of T is the geodesic between its endpoints, and 

(ii) r extends to infinity in two directions. 

We will often just use the term line to mean a 7o-hne where the context is 
obvious. Results in [5] show that any geodesic can be extended into a 7o-hne 
(though often not uniquely). The following proposition establishes existence 
and uniqueness of closest points on lines. 

Proposition 2.1. Given a To-Hne L and a point x £ To, there is 
a unique closest point y £ L to x. 

Proof. The proof relies mainly on the CAT(O) property to enable 
comparison with Euclidean space. Let xq be any point on L and sup- 
pose L{t) is a path-length parameterization of L such that L{0) = xq. Defin- 
ing r = d{xQ,x), consider the triangle xq,x, L{t) for some t>r. The "skinny" 
triangle property implies that 

d{x,L{r)) < d{x,L{t)). 

The same bound applies to L{—t). The closest point y £ L, if it exists, must 
therefore lie on L{t) for t G [— r, r]. Since this is a compact set and since the 
geodesic distance is a continuous function, d{x,L(t)) achieves its minimum 
on the interval. To prove uniqueness of the closest point y, suppose two 
distinct points y,y' £ L achieve the same minimum distance p. Again, the 
"skinny" triangle property for the triangle y,x,y' implies that points on L 
between y and y' are closer to x than distance p. This is a contradiction, 
so y is unique. □ 

Now suppose we are given a set of points xq,xi, . . . ,Xn £ To- For every 
line L through xq we can obtain the projection yi ,...,?/„ of xi rr^ onto L. 
This defines two functions, f\\{L) and f±{L), which are, respectively, defined 
as the sum of squared distance along L, ^ d{xo,yi)'^, and the sum of squared 
distances perpendicular to L, '^d{xi,yi)'^ . 
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Proposition 2.2. There is a To-Hne through xq which maximizes /y. 
Similarly, there is a To-Hne through xq which minimizes f±. 

Proof. We know from the proof of Proposition 2.1 that given any 
hne L through xq, the points yi are at most distance R from xq, where 
R = niax{d{xo,Xi)}. Let Sr be the sphere {z £ To '■ d{z,xo) = R}. Each 
pair {z,z') & SrX Sr represents a pair of geodesies j{z,xo) and ^{xojz'). 
If d{z,z') = 2R, then necessarily the geodesic between z and z' is exact- 
ly 7(2, Xq) followed by ^{xq,z'), and we say z,z' are antipodal. Every line L 
through xq determines an antipodal pair {z,z'), and since the projected 
points yi all lie on the geodesic between that pair, /y and f± only depend 
on the pair {z,z'). By continuity of the function d:SRxSR^K, the set 
of antipodal pairs is a closed subset of Sr x Sr and is therefore compact. 
It follows that there is a geodesic between antipodal points on Sr which 
optimizes either /y or f±. The geodesic can be extended into a line, and 
that establishes the proposition. □ 

The optimal line may be nonunique for two reasons. First, different ex- 
tensions of the geodesic between an antipodal pair {z,z') might exist. This 
would arise, for example, if all the points xi, . . . , x„ lay in the same topolog- 
ical orthant. Second, as in regular Euclidean PCA, the collection of points 
be isotropic, so that the optimal pair (z, z') is nonunique. 

Given the existence of optimal To-lines, we can now consider how to 
construct a principal line. As outlined in the Introduction, construction of 
the principal line consists of the following steps: 

(1) Given trees xi, . . . ,Xn, construct a "central point" xq. 

(2) Given a line L through xq, "project" xi, . . . , x„ onto L by finding the 
closest point yi in L to Xj for i = 1, . . . , n. 

(3) Find the line L such which optimizes the particular choice of objective 
function / (either /y or f±). 

The details of each of these steps is described in turn, but step 3 forms the 
main challenge. 

2.2. Centroids and consensus. Ideally, xq should be chosen so as to min- 
imize the sum of squared distances: 

(2.1) xo = argmin^^d(x,Xi)^. 

X 

In a Euclidean vector space, this reduces to finding the mean of the data 
xi,...,Xn. In tree-space, due to the lack of additive structure, the mean 
does not make sense, and there is no known closed solution to (2.1). Instead, 
Billera et al. [5] suggest using the centroid, which is defined via a recursive 
procedure based on finding the midpoint along the geodesic between any 
two points. However, for large data sets, this procedure is computationally 
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demanding. We therefore propose taking xq to be the majority consensus 
tree [4]. Finding an "average" or consensus tree is a well-studied problem 
in phylogeny [7] and various forms of consensus tree exist. The majority 
consensus topology consists of splits which are found in strictly more than 
half the trees xi,...,Xn- Branch lengths on xq are assigned their average 
value in the data set: 



for all p G where I{p) is the set {i:p G T^^}- Results obtained later in this 
paper were obtained using this choice of xq. However, construction of the 
principal line L does not rely on any particular properties of the point xq, 
and $PCA works with any sensible choice. 

2.3. "Projection" onto To-Hnes. Given any line L and points xi, . . . ,x„, 
Proposition 2.1 established the existence of closest points yi, . . . ,yn L. Here 
we describe computational aspects of this "projection" onto L. Although 
this relies on existing mathematics, as presented in Section 1, the details of 
an algorithm for projection onto a geodesic path have not previously been 
given. We will assume L{t) is a path-length parameterization of L. For each 
point Xi, Euclidean projection under the embedding into M^'' described in 
Section 1.1 is used to obtain a first guess L{si) for y. Amenta et al. [2] 
showed that the geodesic distance between two points is bounded by the 
Euclidean distance: 



It follows that if £{ denotes the Euclidean distance between Xi and L{si), 
then yi lies on ibej). This bounding interval for yi is used as the starting 
point for a golden-ratio search, which is iterated until some tolerance on y^ 
is achieved. It can be shown that finding yi is a convex optimization, so the 
golden-ratio search is guaranteed to converge. The proof of convexity relies 
on the CAT(O) property and convexity of the equivalent Euclidean prob- 
lem. The algorithm of Owen and Provan [25] is used to calculate geodesic 
distance during the golden-ratio search. However, it is not necessary to re- 
compute geodesies from scratch at every iteration: the sequence of orthants 
for a geodesic at one iteration can often be reused in the next iteration, with 
an associated gain in computational efficiency. 

In Euclidean vector spaces, Pythagoras' theorem gives a decomposition of 
the total sum of squared distances = ^c?(xo,Xj)^ of a collection of points 
into contributions from directions perpendicular and parallel to any given 
line L. However, this decomposition does not apply in To with the geodesic 
metric. Nonetheless, we can evaluate the quantities 




iG/{p) 



Ax- - Aj;||2 < d{x,y) <V2x ||Aa; - Aj^||2. 




and 
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for any metric. It can be shown that for the geodesic metric, unhke the Eu- 
chdean case, the sum of these two quantities depends on L. Despite this, 
when evaluated for a principal path L, the sums of squared distances pro- 
vide a useful quantification of variability, as we demonstrate in the results 
sections. 

2.4. Lines through xq. We need to construct To-lines through xq and 
identify one which optimizes our choice of objective function, /. This is 
a challenging problem which has not previously been considered in the lit- 
erature. In order to achieve computational tractability, we restrict to a par- 
ticular class of 7o-liiies and then employ different optimization algorithms 
to search over the restricted class. To motivate this approach, we start by 
considering properties of lines through xq. 

In the topological orthant containing xq, any To-line L consists of a straight 
line segment. For brevity, we will write Tq for the midpoint topology T^^ 
and Ao for the branch length function X^q- Let Qq denote the orthant con- 
taining Xq, and for now assume that xq is a binary tree (so it contains the 
maximal number of splits). If p is a split contained in Qq, then the branch 
length at a point y{s) £ Qq on L has the form 

(2.2) Aj^(^)(p) = Ao(p) + sx wp, 

where Wp is a "weight" associated to split p, and s lies on some interval 
containing zero. The set of weights determines the direction vector of the 
line segment through xq. Given such a line segment, we need to know how 
it might extend beyond Qq into the rest of tree-space. 

Where the segment meets a face of Qq, at least one split is assigned zero 
branch length. Generically, the line segment will meet a co-dimension 1 face 
of Qq, so that just one split p will have zero length. Solving equation (2.2) 
for this split gives 

(2.3) \is){p) = ^ s = -\Q{p)/wp. 

The line then extends from this point into one of the neighboring alternative 
orthants. In a similar way, every other split whose length varies in the initial 
line segment containing xq is associated with a solution of equation (2.3) 
and, correspondingly, with an alternative split related to the first by NNI. 
If we restrict to the set of "generic" lines (those which always meet a co- 
dimension 1 face of every orthant), then finding the optimal line L therefore 
consists of a topological problem (namely, choosing a new split p' to replace 
each p) and a geometrical problem (finding the best set of weights Wp). 
However, these problems are not independent. We can order the solutions 
to (2.3) as we move out from xq in a particular direction along L. Suppose the 
first solution we come to is at s = si and we replace split pi with At the 
next solution s = S2, split p2 is assigned zero length and we replace it via an 
NNI move. However, the choice of splits available as replacements for p2 does 
not depend solely on p2 but also on the rest of the tree topology just before 
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s = S2 — and therefore potentially on the choice of replacement p'^ ofpi. Thus, 
the topological aspect of construction depends on the relative order of the 
solutions to (2.3), which in turn depends on the weights Wp. Optimization 
over the set of possible weights and splits will be computationally demanding 
for trees with more than a few species — an exhaustive search will not be 
possible. 

A key feature of the description above is the assumption that line segments 
meet the boundary of orthants in co-dimension 1 faces. We restrict our search 
space for L similarly, but take into account the possibility that xq might not 
be fully resolved. We make this more formal as follows. 

Definition 2. Suppose p £ S is compatible with xq and p' G S is ob- 
tained by extended nearest neighbor interchange of p in Tq U {p}. The simple 
line through xq associated with p,p' and weight w is the path y{s) E To de- 
fined by 

\{s){p) = ^o{p) + sw if Ao(p) + su; > 

= otherwise, 
\{s){p') = -(Ao(p) + sw) if Ao(p) + sw<0 

= otherwise, 
^y{s){q) = Ao(g) if g ^ {P,p'}- 

Such a path moves through a single pair of orthants. The next definition 
extends simple lines to pass through more than two orthants. 

Definition 3. Suppose x{s) is the simple line through xq defined by 
split pairs {pi,p[), . . . , {pk,p'k) and weights wi, . . . ,Wk, and suppose that the 
pair of splits {pk+i,p'k+i) ™d weight Wk+i G M satisfy the following: 

(i) Pk+i is compatible with x{s) for all s such that Ao(pfc+i) + swk+i > 0, 

(ii) p'j^^i is compatible with x{s) for all s such that Ao(pfe+i) + swk+i < 0, 

(iii) Pk+i,p'k+i are related by XNNI in x{sk+i) where Sk+i = -Ao(pfc+i)/ 

Wk+l- 

Then the simple line y(s) defined by {pi,p'i), ■ ■ ■ , {pk+i,p'k+i) and weights 
wi,...,Wk+i is given by 

\(s)iPi) = ^oiPi) + swi if Xo{pi) + swi>0 

(2.4) 

= otherwise, 
K(s)ip'i) = -(Ao(k) + swi) if Xo{Pi) + swi<0 

(2.5) 

= otherwise, 

(2.6) Xyis){q) = HQ) ifg^fe}u{pa, i = i,...,k + i. 
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To prove that simple lines satisfy the conditions of Definition 1, Proposi- 
tion 4.2 of [5] can be applied to any pair of points on a simple line in order 
to show that the subpath between those points is geodesic. 

Simple lines through xq resemble the geodesic between xi and yi in Fig- 
ure 2: they continue between orthants without bends, and hence are always 
locally geodesic. Moreover, any straight line segment through xq can be ob- 
tained as part of a simple line. Nonetheless, restriction to the class of simple 
lines removes many lines from consideration. Cone paths are ruled out, to- 
gether with any geodesic for which some subset of the splits changes like 
a cone path. (This latter class of geodesies resembles xi — yi in Figure 2 for 
some splits and X2 — ?/2 for others.) This restriction is carried out for the 
sake of computational tractability. More discussion is given in Section 6. 

Definition 3 describes how to extend a simple line on k split pairs to 
one on fc + 1 split pairs. Our algorithms for finding an optimal simple 
line are based precisely on this operation. Suppose a simple line L is de- 
termined by sets of splits P = {pi, . . . ,pk},P' = {p'l, . . . ,p'j^} and weights 
W = {wi, . . . ,Wk}- Conditions (i)-(iii) of Definition 3 place constraints on 
any proposed splits p,p' and weight w which might be used to extend L. 
The values Si = —\Q{pi)/wi correspond to points at which L crosses the 
boundary between orthants, and we can assume they are ordered with 
si < S2 < ■ ■ • < Sk- They divide L up into A; + 1 intervals Ij = [sj,Sj+i] for 
i = 0, 1, . . . , A; taking sq = — oo and s^+i = oo, such that the topology of L 
is constant on each interval. Let U denote the tree topology on /j. Now 
suppose that p is compatible with Tq and p' is a proposed replacement 
for p. Suppose we also propose an interval /j on which we require the XNNI 
move to be performed. Conditions (i)-(iii) are then equivalent to the follow- 
ing. 

Geometrical constraint: 

Si < < Sj+i, so the XNNI move occurs on interval /j. 

w 

Topological constraint: 

• If t/; < 0, then p must be compatible with to, . . . ,ti and p' must be com- 
patible with ti\p, . . . , tk+i \ p; or 

• if w > 0, then p' must be compatible with to\p, . . . ,ti\p and p must be 
compatible with ti,. . . , t^+i- 

When p is not contained in Tq, but instead extends the midpoint topology, 
then \o{p) = and the solution to equation (2.3) is s = 0. In this case, the 
geometric constraint corresponds to an unbounded interval for w, and the 
interval li on which the XNNI move p ^ p' takes place must necessarily 
contain s = 0. 

2.5. Greedy algorithm for finding an optimal simple line. The following 
algorithm repeatedly extends a simple line by adding in a new split pair at 
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each iteration. The pair chosen is the one which gives the best improvement 
in the objective /: 

(1) Let F be the set of feasible sphts (see below). 

(2) Consider in turn every split p va. F that is compatible with Tq, and 
every possible replacement p' for p. 

(3) For each interval /j, test whether p' is an XNNI replacement of p in tj. 

(4) If (ii) holds on interval /j, then next check whether the pair p,p' 
satisfies either topological constraint for that interval. Fix the sign of w 
depending on which constraint applies. 

(5) If either topological constraint holds, then find w that maximizes the 
variance of the projected points on L, subject to the geometrical constraint 
and sign of w. This is carried out using the golden ratio search for the 
optimum value of w. 

(6) Repeat for all feasible pairs p,p' £ F and find the pair that gives the 
maximum projected variance. 

(7) Add p, p' and w to the lists P, P' and W, and reorder the lists 
according to the solutions of (2.3). Remove p,p' from F, and repeat from 
step 2. 

The algorithm continues until no more splits can be added to L. This will 
occur in at most m — 3 iterations where m is the number of species, since 
every tree can contain at most m — 3 nontrivial splits. 

The set of feasible splits F could be taken to be the entire set of possible 
splits S, but this is inefficient. If neither split p,p' is contained in any of the 
trees xi,. . . ,Xn, then adding the pair to L will only increase the distances 
d{xi,yi) so that L is a worse approximation to the data. We therefore take F 
to be the set of nontrivial splits found in at least one tree xi, . . . ,Xn- It is 
possible that at some stage the best improvement in / might be given by 
some p £ F and a replacement p' ^ F (e.g., consider the case that all the 
trees Xi lie in the same orthant). However, in such a situation, the data 
would not be informative about the choice of p' , and so we disregard this 
possibility. 

The greedy algorithm terminates after at most m — 3 iterations. During 
each iteration Odi^^p) pairs of splits and 0{m) possible intervals for the 
move p^ p' are considered. For each pair of splits and interval, n trees are 
projected onto the proposed line. Each projection requires O(m^) steps. The 
golden ratio search during projection is performed to a fixed tolerance, and 
so is independent of m, n and \F\. Overall, the algorithm therefore requires 
0{m^ X n X l-Fp) steps where F is at worst 0{nm). 

2.6. Monte Carlo optimization algorithm. A Monte Carlo optimization 
algorithm was also implemented in order to provide comparisons with the 
greedy approach. A simulated-annealing type approach was adopted, where 
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the state at each iteration comprised a simple hne L through xq. At each it- 
eration a "birth" or "death" move was randomly proposed from the current 
state. Birth moves consisted of adding a valid split pair to L, while death 
moves consisted of removing a split pair from L. Birth moves were obtained 
by selecting p & F uniformly at random, then selecting p' uniformly at ran- 
dom from the possible XNNI replacements of p satisfying the constraints 
defined above. The weight assigned to p,p' was obtained by the golden ra- 
tio search, as for the greedy approach. Death moves were carried out by 
choosing at random the split pair at either end of L (i.e., with largest pos- 
itive or negative Sj) and removing it. Removing other split pairs results in 
incompatible sets of splits along L and is therefore forbidden. The relative 
probabilities of birth and death depended on the number k of split pairs in L 
and were designed to favor birth for small k and death when k was large. 
Proposals leading to improvement in the objective were always accepted. 
Other proposals were accepted with probability 



where 5 is the absolute difference of the objective for the proposed and 
current state, D is a bound for 6, and r is the "temperature." For f = f±, 
D was taken to be ^ d{xo,Xi)'^, while for f = f\\, D was taken to be /y for the 
current state. The temperature r was slowly decreased as the optimization 
progressed. 

2.7. Branch length transformations. We investigated certain transforma- 
tions of the data xi, . . . , prior to analysis with <I>PCA. Kupczok et al. [22] 
suggest scaling each tree xi,...,Xn to have the same total branch length. 
In practice, this seemed to make little difference to the examples we looked 
at in the results section below. Instead we considered the following branch 
length normalization. For each split p, branch lengths were scaled by a con- 
stant so that the average branch length associated with p across the whole 
data set was unity. This was repeated for each split in the data set. The idea 
behind this is to make $PCA measure variability relative to branch length 
and to amplify the variability in short branches. In regular PCA the cor- 
relation matrix can be analyzed instead of the covariance matrix, and this 
branch length transformation can be thought of as being analogous to the 
correlation matrix version. Principal geodesies obtained for branch-length 
normalized data can be back-transformed onto the original scale by scaling 
the weights W . 

3. Relationship to the work by Wang and Marron. Wang and Mar- 
ron [29] previously developed PCA in a space of trees, and on account of 
the similarities of our approach to theirs, in this section we look in detail 
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at the relationship between the approaches. The steps underlying our ap- 
proach specified at the start of Section 2 were taken directly from [29], but 
the details of how these steps are carried out are quite different on account 
of the different geometries under consideration. 

In [29] rooted bifurcating trees are considered, but, unlike phylogenetic 
trees, the leaf vertices are not assigned taxon labels. Instead, each vertex 
can have a "left" and a "right" descendant, and trees in the data set can 
have different depths from root to leaf. Most importantly, branches do not 
have any associated length, but, instead, each vertex present in a tree has 
an associated real number (or vector). An example of such data consists of 
blood vessel information from medical imaging: vertices represent blood ves- 
sels, edges represent connections between blood vessels, and the data value 
associated to each vertex corresponds to some measurement at that point 
in the blood vessel structure. One crucial difference between the two spaces 
of trees is that in [29] there is no relationship between the values associated 
to vertices and the topological structure of the tree. This is different from 
the space To, in which branch lengths can be shrunk down and replaced by 
an alternative topology. 

This separation of "topological" and "geometrical" aspects of the prob- 
lem in [29] results in principal components with separate topological and 
geometrical parts. In Wang and Marron's terminology, a structure tree line 
is a sequence of vertices, each descended from the previous vertex, which 
can be thought of as (discontinuous) "growth" of a tree toward a leaf, by 
grafting on branches. In contrast, an attribute tree line consists of a fixed 
tree structure with "direction vectors" associated to vertices. This is clearly 
very different from the lines constructed by $PCA in which the principal 
path reflects both topological and geometrical variability in the data set. 

Not surprisingly, given the different structures of the spaces considered, 
the metrics used in the two approaches differ. The metric in [29] is a linear 
combination of the (unweighted) Robinson-Foulds metric [27] and a Eu- 
clidean distance between the vectors associated to each vertex. This metric 
is inexpensive to compute, in contrast to the geodesic metric which we con- 
sider, and this reduces the computational burden of their approach relative 
to ours. The midpoint xq in [29] is taken to have the majority consensus 
topology [4], as used here, since this minimizes the sum of the Robinson- 
Foulds distances of the midpoint from xi, . . . However, while Wang and 
Marron obtain an exact form of Pythagoras' theorem with their metric, that 
is not the case for $PCA (see Section 2.3). 

In summary, the method of Wang and Marron cannot be applied directly 
to phylogenetic trees, since the trees they consider lack taxon labels and 
branch lengths and so cannot represent phylogenies. While our approach 
builds on the same framework as that laid out in [29], differences in the 
geometries of the spaces under consideration make the mathematical details 
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of the implementation of PCA substantially different. In particular, <I>PCA 
relies heavily on the geometry of To described by Billera et al. [5]. It is inter- 
esting to note how a seemingly small difference in the geometry of the space 
under consideration can substantially change the way PCA is implemented. 

4. Simulation studies. 

4.1. Simple mixtures. <I>PCA was used to analyze collections of randomly 
generated trees with two (or more) known underlying topologies. These sim- 
ulations were not intended as a model of a specific process giving rise to al- 
ternative trees, but were performed in order to verify the methodology and 
demonstrate how it works on simple examples. We describe the simulations 
very briefly here, but give more details in the supplementary material [24]. 
Two sets of simulations were performed. In the first set, trees were simulated 
such that each had one of two possible topologies ti or t2. The underlying 
topologies ti,i2 were related by an NNI move and represented alternative 
positions for a clade within the tree. Topology ti was adopted with proba- 
bility 9 and t2 with probability 1 — 9. Apart from branches affected by the 
change in topology, all other branch lengths were kept fixed. For each value 
of 9, 100 trees were randomly generated in this way. Additional variability 
was added by simulating a DNA alignment for each tree, and then replac- 
ing the tree with the maximum likelihood (ML) tree estimated from the 
alignment. <I>PCA was used to analyze these estimated trees. A second set 
of simulations was performed in which there were two correlated changes in 
topology. Each tree consisted of two subtrees, and each subtree had either 
topology ti or t2 as in the first set of simulations. The alternative topologies 
in each half of the tree were simulated to arise with correlation p. Again, ad- 
ditional variability was added by simulating alignments and replacing each 
tree with an ML estimate. 100 trees were generated for each pair of values ^, p 
and $PCA was used to analyze each set of estimated ML trees. 

The results indicated that optimization of /y gave the best performance: 
paths obtained by optimizing /_|_ sometimes missed the changes in topology 
imposed in the data sets. In this non-Euclidean setting the sum (i| -|- <f\_ is 

generally less than the total sum of squared distances dg, so optimization 
of /j_ may result in principal paths that fail to capture variability in the 
data by finding paths in which both sums (i| and dP\_ are small. 

In both sets of simulations, $PCA with /y gave principal paths corre- 
sponding to the change between the imposed alternative topologies. In the 
first set of simulations, based on a single pair of alternative topologies, as 9 
increased the change between the underlying topologies ii,t2 dominated the 
principal path (the corresponding splits received a higher weight) as variabil- 
ity due to tree estimation from alignments was dominated by the imposed 
variability in topology. In the second set of simulations, for small values of p 



PRINCIPAL COMPONENTS FOR PHYLOGENIES 



19 



Table 1 

Results of the LBA simulations. The largest two weights w for split pairs and 
a description of the corresponding changes in topology are given. Weights were 
normalized to have unit Euclidean norm 

w Change in topology 

Without normalization (dy/do ~ 2.6%) 
0.859 Guilardia moves past pairing with Rhodophyta to top of clade with plants 

0.152 Guilardia moves from top of clade with plants to position closer to Archaea 

Branch lengths normalized (dy/dg = 10.3%) 

0.706 Microsporidia moves from grouping with fungi to top of clade with Metazoa 

0.482 Microsporidia grouped with Archaea 



the principal path corresponded to change between the alternative topologies 
on one part of the tree, with the other pair of alternative topologies receiv- 
ing a low weight. For larger p, the correlated alternatives in both parts of 
the tree were identified by the principal path. More details are given in the 
supplementary material [24]. 

4.2. Long branch attraction. In order to demonstrate a potential appli- 
cation of <1>PCA, a simple study of long branch attraction (LBA) was per- 
formed. LBA is a feature of phylogenetic methods in which species on long 
branches are often grouped together erroneously on estimated trees. We 
took a tree from the literature [6] representing a deep phylogeny of eukary- 
ote species which includes two long branches and a distant out-group, as 
shown in Figure 3. 100 trees were simulated by first simulating amino acid 
alignments from the base tree (300 base pairs, WAG+4r model) using the 
seq-gen software [26] and then obtaining an ML estimate tree from each 
alignment using phyML [17]. <I>PCA was used to analyze the set of trees 
estimated from the simulated alignments. 

Analysis of the simulated trees was carried out first with un-normalized 
data and then again with the normalization procedure described in Sec- 
tion 2.7. Optimization was carried out using the /y objective function, and 
results were obtained with both the greedy and Monte Carlo algorithms. De- 
spite long runs, the Monte Carlo algorithm failed to improve on the results 
obtained with the greedy algorithm. The results obtained with the greedy 
algorithm are shown in Table 1 and Figure 3 shows the principal path ob- 
tained with normalized branch lengths. The "proportion of variance" d|/dQ 
was greater with the normalized data, and so we suggest that normalization 
is preferable for this data set. 

As explained in [6] , estimated trees tend to place the long branches ( Guil- 
lardia and Microsporidia) next to the out-group (Archaea). The analyses of 
both the un-normalized and normalized data show this effect with, respec- 
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Fig. 3. Simulation study of LBA. (a) Underlying tree with two long branches and distant 
out-group (archaea, on the left), (b)-(e) Trees along the principal path. Branches were 
normalized to have unit mean. No back-transform to the original scale was performed, since 
this obscured the visual effect. Arrows highlight the microsporidia group (labeled "microsp") 
moving round to join the outgroup (labeled "cn-arch" and "eu-arch"). 
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lively, Guillardia and Microsporidia "floating" round the tree to be placed 
closer to the Archaea. The fact that each principal path captures a single 
such effect suggests that the attraction of the two long branches to the Ar- 
chaea is uncorrelated in the data. $PCA exactly captures the expected LBA 
artefact in the simulated data. 

5. Analysis of metazoan data. <^PCA was applied to a set of 118 gene 
trees from 21 metazoan (animal) species, previously analyzed in [22]. $PCA 
was performed on both unsealed and branch length normalized data using 
the /|| objective function. The Monte Carlo optimization algorithm obtained 
principal paths with slightly higher /y scores than the greedy algorithm, and 
so we refer to that set of results here. The principal paths obtained with 
the two algorithms were similar, and shared the majority of split pairs in 
common. The "proportion of variance" (iy/dg was 1.8% for the unsealed data 

and 4.6% for the normalized data — relatively low in both cases. However, 
the simulation studies produced similarly low scores (between 3% and 5% 
on artificial data), suggesting that low scores might be common even when 
<I>PCA is successfully capturing aspects of variability in the data. Further 
comments about the low proportion of variance are made in Section 6. 

The principal path obtained for the unsealed data corresponded to uncer- 
tainty in the positioning of the out-group, yeast. It moves from being placed 
next to the worms to being grouped with sea squirt. This might be an LBA 
effect since sea squirt and yeast lie on relatively long branches. Results of 
the analysis using data with normalized branch lengths are shown in Fig- 
ure 4. The principal path indicates uncertainty in the placement of Human: 
it is either grouped with Chimpanzee or Macaque. The position of Human 
relative to its neighbors was a longstanding problem in phylogenetics [15]. 
Uncertainty in the positioning arises from the presence of a relatively short 
internal branch in the species tree joining Human and Chimpanzee to the 
other primates. Although well known in evolutionary biology, this simple 
example illustrates how $PCA can be used to identify and visualize alter- 
natives within a set of trees. 

6. Conclusion. We have presented a procedure for identifying principal 
paths in the space of phylogenetic trees which best approximate a set of 
alternative phylogenies in an analogous way to standard PCA in Euclidean 
vector spaces. A key feature of the approach is the use of metrics that com- 
bine geometric and topological information about trees. The principal paths 
constructed coincide with the geodesic between every pair of points on the 
path. Each principal path is equipped with a summary statistic analogous 
to the Euclidean proportion of variance which quantifies variability along 
the path. 

Results obtained from simulated and experimental data sets gave values 
for the "proportion of variance" d^Jd^ which were relatively low in compar- 



22 



T. M. W. NYE 




Fig. 4. Trees along the principal path for the normaUzed metazoan data m order (a)-(d). 
(a) corresponds to the majority consensus topology. Human moves from being grouped with 
Chimp (labeled "pantro") to Macaque (labeled "mmulatta"), as highlighted by the solid 
arrows. 

ison to typical values for standard PC A (e.g., about 5% for the normalized 
metazoa data set). This is the result of two features of the problem. First, the 
data sets analyzed in this paper are high dimensional (containing over 100 
different splits), and in the same way as for standard Euclidean data, this 
tends to lead to lower proportions of variance. To illustrate this, consider, for 
example, a multivariate normal distribution with dimension 100 and covari- 
ance matrix diag(5(T^, u^, . . . , cr^). Standard PCA would give a proportion of 
variance of roughly 5% even though the variance along the principal com- 
ponent is substantially higher than in other directions. Second, the failure 
of Pythagoras' theorem in tree-space means that for any analysis variance 
"leaks out," that is, the sum of squares (i| -|- is less than the total sum of 

square distances for the original data d^, further decreasing d^^/d^. 

In order to construct principal paths, two approximations have been im- 
posed: 

(1) A greedy algorithm or Monte Carlo search is carried out in the con- 
figuration space of paths in order to find the optimal path. There is no 
guarantee that the optimal path will always be found. 
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(2) The configuration space itself is restricted to a subclass of paths (re- 
ferred to as simple lines). By restricting in this way we might rule out cap- 
turing types of variability in the data under analysis. 

We consider the second of these approximations to be more limiting, and 
it is difficult to generalize the approach we have described to overcome 
it. Another area requiring further research is the construction of higher- 
dimensional approximations to data, analogous to second and third, etc. 
principal components. Construction of higher order principal components 
in Euclidean vector spaces is carried out by working orthogonally to the 
first principal component. Tree-space is not equipped with an inner prod- 
uct, and so this procedure cannot carry over directly to To- Our algorithms 
for constructing the principal path L do not therefore readily generalize to 
give higher order paths. Instead, we would need to consider two-dimensional 
subsets of To which approximate the data xi,. . . ,Xn as closely as possible. 
In analogy to the definition of To-lines, we would require that such sub- 
sets n contain the geodesic between any two points in 11, and so 11 would 
locally resemble a plane in each orthant. However, in contrast to the the- 
ory of geodesies, the theory of higher-dimensional surfaces in tree-space is 
not well developed. We have not attempted to advance this theory in this 
paper, but have focused on the already considerable problem of identifying 
lines which best approximate the data. 

The $PCA procedure has been presented as an empirical analysis of sam- 
pled trees without reference to any underlying distribution that generated 
the trees. Distributions such as sampling distributions, bootstrap distribu- 
tions and Bayesian posteriors are of fundamental importance in phyloge- 
netic inference, but the geometrical properties of these distributions have 
received little study. Billera et al. [5] considered spherically symmetric dis- 
tributions with density decaying exponentially away from a central point. 
A second form of isotropic distribution consists of the limit of a random 
walk in tree-space from a fixed central point. By simulating samples from 
a suitable random walk and carrying out $PCA on the samples, an empiri- 
cal p- value could be assigned to the proportion of variance of a principal line 
constructed from experimental data, as a test for significant departure from 
isotropy. Holmes [20], however, suggests that the assumption of spherical 
symmetry is not realistic for most distributions of interest. One area where 
distributions on tree-space have been defined more precisely is the study of 
population-genetic effects on gene phylogenies [9]. Such distributions could 
be studied in the context of tree-space geometry, and it might be possible 
to obtain the sampling theory of principal lines under $PCA in this case. 

This paper has presented the results of applying $PCA to some relatively 
simple examples, and demonstrated the type of information principal paths 
reveal. The method can be applied to larger data sets and it has the poten- 
tial to provide new insights into a range of problems in evolutionary biology. 
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Software for performing $PCA and for visualizing principal paths as anima- 
tions of trees is available in the supplementary material [24], together with 
the data sets analyzed in this paper. 
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SUPPLEMENTARY MATERIAL 

Principal components analysis in the space of phylogenetic trees: Supple- 
mentary information (DOI: 10.1214/11-AOS915SUPP; .pdf). This contains 
further information about the simulation studies in Section 4.1. 
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